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Abstract 

We consider how to store an array of length n such that we can quickly list all the distinct 
elements whose relative frequencies in a given range are greater than a given threshold r. We 
give the first linear-space data structure that answers queries in worst-case optimal C(l/r) time, 
assuming r is pre-specified. We then consider the more challenging case when r is given at query 
time. The best previous bounds for this problem are C(nlga) space and C(l/r) time, where 
a < n is the number of distinct elements in the array. We show how to use either n lg lg a + 0(n) 
space and C(l/r) time, linear space and 0((1/t) lglg(l/r)) time, or compressed space and 
0((1/t) lglg a) time. We also consider how to find a single element whose relative frequency 
in the range is below the threshold (range minority), or determine that no such element exists. 
We show how to modify a linear-space data structure for this second problem by Chan et al. 
(2012) so that it uses compressed space and almost the same time 0(1/t). We obtain in passing 
some results of independent interest, for example density-sensitive query time for one-dimensional 
range counting. 
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1 Introduction 

Parameterized range- majority (PRMaj) queries have been the subject of recent research, 
e.g., [THIIHIIH] an d the papers mentioned below. For this problem we are asked to preprocess 
an array A such that later, given i and j, we can quickly list all the distinct elements whose 
relative frequency in is above a given threshold r. Karpinski and Nekrich [T5] showed 

how, if t is given to us at the same time as A, then we can store A in 0(n/r) space on 
a RAM with f2(lg n)-bit words and answer queries in C((l/r)(lglgn) 2 ) time. Durocher et 
al. [TT] reduced the space bound to C(nlg(l/r + 1)) words and the time bound to C(l/r). 
Since there could be 1/r elements to list, this time bound is worst-case optimal. Gagie 
et al. [TS] showed how to store A cither in 0(n(H + 1)) words with query time 0(1/ t) 
or in 0(n(H + 1)) bits with query time C((1/t) lglg a), where H < Iga is the entropy 
of the distribution of elements in A and a < n is the number of distinct elements in A. 
Throughout, we assume the distinct elements in A are {1, . . . ,cr}; otherwise, they can be 
mapped to that set using a simple translation array. Gagie et al. also showed that, for 
their first data structure, r can be given with i and j at query time. That is, we can 
store A in 0(n(H + 1)) words such that later, given i, j and r, in C(l/r) time we can list 
all the distinct elements that occur at least r(j — i + 1) times in Chan et al. [TO] 

independently gave another data structure that can handle variable r, but it takes O(nlgn) 
words and the same time. These are the only previous solutions for the important case of 
variable r. 
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Table 1 Results for the PRMaj problem on an array of length n containing a distinct elements, 
in which the distribution of elements has entropy H < Igu. The solutions in lines 3 and 9 are 
actually compressed. 



Source 



Space (words) 


Time (order) 


variable r 


0(n/r) 


(l/r)(lglgn) 2 


no 


0(nlg(l/r)) 


1/t 


no 


0(n) 


(l/r)lglga 


no 


0(n) 


1/T 


no 


0(n(H + l)) 


1/T 


yes 


C(nlg n) 


1/T 


yes 


nlglgo- + 0(n) 


1/T 


yes 


0(n) 


(l/r)lglg(l/r) 


yes 


n + o(n) 


(l/r)lglga 


yes 



Karpinski and Nekrich [18] 
Durocher et al. |11| 
Gagie et al. |15| 



Theorem 



Gagie et al. |15| 
Chan et al. [TU] 



Theorem 
Theorem 
Theorem 
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In this paper we start by giving the first linear-space data structure for the original 
PRMaj problem (i.e., when r is pre-specifred at construction time) that answers queries in 
worst-case optimal C(l/r) time. We then turn our attention to the more challenging case 
when r is given at query time. We show how to store n lg lg a + 0(n) words such that later, 
given i, j and r, in C(l/r) time we can list all the distinct elements that occur at least 
T (j — « + 1) times in Note the space is near-linear and the time is still worst-case 

optimal. We then give another solution that represents A using only nH + o{n) lg a bits, 
and solves queries in the slightly higher time C((l/r) lglg a). This the first succinct (i.e., 
n + o(n) words) solution to the problem (for fixed or variable r), and is also compressed. In 
the full version of this paper we will reduce its space bound further, to nH + o(n(H + 1)) 
bits, making our solution fully compressed. Finally, we give a third solution with 0(n) 
words of space and time C((l/r) lglg(l/r)). Table [l] summarizes previous results and our 
own ones. In our time complexities, 1/t actually stands for min(l/T, a). To measure our 
space in words we assume the computer word contains exactly [lg n] bits. 

Chan et al. |l(Jj also introduced the related problem of parameterized range minority 
(PRMin). For this problem we are asked to preprocess A such that, given i, j and t, we 
can quickly find a single clement that occurs in but at most r(j — i + 1) times, or 

determine that no such element exists. They gave a data structure that takes (D(n) words 
and C(1/t) time. In this paper we improve the space bound to (1 + e)nH + 0(n) bits for 
any constant e > 0. We also give another solution that takes only nH + 0(n) + o(nH) bits 
but 0((1/T)a(n)) time, where a(n) in the very slowly-growing inverse Ackermann function 
of n. This data structure is both compressed and the first succinct solution for PRMin. In 
the full version of this paper we will reduce the space bound further, to nH + o(n(H + 1)) 
bits, making our solution fully compressed. 

Along the way we obtain some results of independent interest. Perhaps the most relevant 
one is a density-sensitive data structure for the one-dimensional range counting problem: 
Given n points in a discrete universe, we can build a linear-space data structure that can 
compute the number occ of points inside any range in time CM lglg ^ l c t\ ) ■ 
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2 Parameterized Range Majority 

Our data structures are based on the well-studied operations of access, rank and select. If S 
is a sequence (e.g., a one-dimensional array) then 5.rank a (fc) is the number of occurrences 
of a in and S'.select a (fc) is the position of the fcth occurrence of a in S. The data 

structures we use for access, rank and select in this section are by Patragcu [25] for binary 
sequences, by Ferragina et al. [T3] for sequences over polylogarithmic alphabets, and by 
Barbay et al. [3] for general sequences (there are slightly better choices for the last two, but 
these simple ones are sufficient for us). For the first two cases we can support the three 
operations in constant time and space nH + o(n) bits, where H is the zero-order entropy 
of S (for bitmaps, the o(n) term is as good as 0(ro/lg c n) for any constant c). For large 
alphabets, we use nH + o(n(H + 1)) bits and can have access or select in constant time, and 
the other two operations in time O(lglger). Moreover, we can add O(nlglgtr) bits in order 
to support in constant time partial rank queries, which are of the form S.T&nks\k]{k) [I]. 

The main idea in this section is to precompute, for a sampling of the possible intervals, 
the frequent elements in those intervals. Then we process the query range using a few 
sampled intervals that contain it. Interval sizes and frequent elements are chosen so that 
the latter are sufficiently few to allow us checking one by one whether they are a majority in 
the query range. If we spend sufficient space to spot the first occurrence of those elements 
in the query range, then we achieve optimal time by checking them using only select and 
partial rank, in constant time. If we use less space, instead, we can only spot one arbitrary 
occurrence in the query range, and need to resort to the slower rank operation. 

2.1 Linear space and optimal time for pre-specified r 

Let C[l..n] be the array in which C[k] = select^] (rank^^fc) — 1) if A[k] is not the first 
occurrence of that distinct element in A, or if it is. Notice that, if i < k < j, then C[k] < i if 
and only if A[k] is the first occurrence of that distinct element in Muthukrishnan [5T] 

showed that, if we store a data structure supporting range-minimum queries (RMQs) on C 
in 0(1) time then later, given i and j, we can list the positions of the first occurrences in 
of the distinct elements in in time proportional to the number of such elements. 

Sadakane [23: adapted Muthukrishnan's approach to the case when we have access to A but 
not C and the RMQ data structure returns only the position of the range minimum in C, 
in 0(1) time. Fischer [JJ] showed that such an RMQ data structure takes only 2n + o(n) 
bits. 

► Lemma 1. Suppose we are given A, t and a value b < \lgn\. Then we can store 0(n) 
more bits such that later, given i and j with [lg(j — « + l)J = b, in 0(l/r) time we can build 
a list of 0(1/t) positions that includes the position of the first occurrence in A[i..j\ of each 
distinct element that occurs at least t(J — i+ 1) times in A[i..j]. 

Proof. We store a bitvector S[l..n] with B[k] = 1 if A[k] occurs at least r2 b times in 
A[k..k + 2 b+1 — 1], and B[k] = otherwise. Let A' be the array consisting of elements of 
A indicated by Is in B — so A'[k] = A[B .select (k)) — and let C" be the array in which 
C'[k] = select a' [fe](rank^/[fei(fc) — l) if A'[k] is not the first occurrence of that distinct element 
in A' , or if it is. We store neither A' nor C explicitly (although we can use A and B to 
support access to A' in 0(1) time) but only an RMQ data structure over C . Together, B 
and this RMQ data structure take 0(n) bits. 

Given i and j, we compute i' = _B.rank(i — 1) + 1 and j' = £?.rank(j) and use Sadakane's 
algorithm and the RMQ data structure for C to list the positions of the first occurrences in 
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A'[i'..f] of the distinct elements in A'[i'..j']. Once we have these positions, we use select on 
B to map them back to positions in A. Notice that any distinct element in A'[i' ..f] occurs 
at least r2 b times in A[i..j + 2 b+1 — 1], so there are 0(l/r) of them. It follows that we 
use a total of 0(l/r) time to build a list of 0(1/t) positions in A. The first occurrence in 
of any element that occurs at least r(j — i + 1) times in is marked by a 1 in 

B; therefore, that position is in our list. -4 

► Theorem 2. Given A and t, we can store A in linear space such that later, given i and 
j, in 0(1/t) time we can list all the distinct elements that occur at least t(J — i+ 1) times 
in A[i..j\. 

Proof. We apply Lemma [T] to A for each value of b between and |lg n\ , which takes a 
total of 0{n\gn) bits, so 0(n) words. We also store a single linear-space data structure 
supporting access, select and partial rank on A in 0(1) time. Given i and j, in 0(l/r) time 
we build a list of 0(1/t) positions in that includes the position of the first occurrence 

in of each distinct element that occurs at least r(j — i + 1) times in ^4[z..j]. For each 

position k in this list, we return A[k] if Aselect^j ^Arank^^fc) + [~r(j — i + 1)] — lj < j. 
Repetitions are avoided by marking a bitmap of length a < n. < 

In the full version of this paper we will give slightly tighter bounds. For example, we do 
not need to consider b < lg( 1 /r) because, if j — i+ 1 = 0(1 /r) , then we can afford to run on 
a linear-time algorithm by Misra and Gries [2UJ that lists the distinct elements that 
occur at least r(j — i + times. Taking advantage of this, we can reduce our space bound to 
0(n(lgn — lg(l/r)) = 0(nlg(™)) bits, or 0(nlg(rn)/lgn) words, plus compressed space 
to support access, select and partial rank on A. 



2.2 Near-linear space and optimal time 

We now turn our attention to the more challenging case of when r is given at query time. 



We reuse some of the techniques from Section 2.1 to reduce the space bound for this version 
of the problem as well. 

► Lemma 3. Suppose we are given A and C and a value b < [lg n J- Then we can store 
nlglg cr + 0(n) more bits such that later, given i, j and r with [lg(j — i+ 1)J = b, in 0(1 /r) 
time we can build a list of 0{1/t) positions that includes the position of the first occurrence 
in A[i..j] of each distinct element that occurs at least t(J — i+1) times in A[i..j]. 

Proof. We store an RMQ data structure over C, which takes 0(n) bits. If 1/r = £l(a), then 
we can simply list the positions of the first occurrences in ^4[i..j] of all the distinct elements 
in using Muthukrishnan's algorithm in O(a) time. Therefore, we need consider only 

the case when 1/t < a. 

Let T[l..n] be the array in which T[k] is the smallest integer t < |~lg a] such that A[k] 
occurs at least 2 & /2* times in A[k..k + 2 b+1 — 1], or oo if there is no such integer. We can store 
T in nlglg a + o(n) bits with an instance of Ferragina et al.'s |13j multiary wavelet tree and 
support access, rank and select on T in 0(1) time since T is over a poly logarithmic alphabet. 
For t < |~lg er] , let A t and C t be the arrays whose lengths are the number of occurrences of 
t in T and in which A t [p] = A[T. select t (p)) and Ct[p] = C[T.select t (p)]. Notice that we can 
support access to A t and Ct in 0(1) time using select on T and access to A and C, so we 
need not store A t and Ct explicitly. For each t < |~lg a] , we store an RMQ data structure 
over Ct- Since the total length of Co, . . . , 0[i g cr] is at most n, all the RMQ data structures 
together take a total of 0(n) bits. 
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Given i, j and r > l/a, for each value t < |~lg(l/r)] we compute i' = T.rank t (i — 1) + 1 
and j' = T.rank t (j), then list the positions of the first occurrences in At [«'.._?'] of the distinct 
elements in A t [i'..j']. Once we know the position p of the first occurrence of a distinct 
element in A t [i' ..j'), we can use k = T.selectt(p) to find the corresponding position in A[z..j]. 
Each distinct element in A t [i' ..j') occurs at least 2 b /2' times in A[k..k + 2 b+1 — 1] for some 
i < k < j, which is contained in A[i..j + 2 b+1 — 1], and thus there can be at most 4 • 2* of 
them because 2 b < j-i+l < 2 b+1 . It follows that we list a total of £ t r ^; 1/r)1 4 ' 2t = 0(1/t) 
positions for all values t together. 

Any distinct element that occurs at least r(j — i + 1) times in occurs at least 

2 b /2 ^sC 1 /" 7 ")! < T 2 b < r(j — i + 1) times in ._?"], so if k > i is its leftmost occurrence in the 
interval, it occurs at least 2 b /2 T'sf 1 / T ")1 times in which contains A[k..k + 2 b+1 — 1}. 

Thus, we list the position of the first occurrence in of each such distinct element. < 

► Corollary 4. Given A and C , we can store nlglga+0(n) more words such that later, given 
i, j and t, in C(1/t) time we can build a list of 0(1/t) positions that includes the position 
of the first occurrence in A[i..j] of each distinct element that occurs at least r(j — % + 1) 
times in A\L..f\ . 

Proof. We apply Lemma [3] for all values of b between and |_lg rzj , which takes a total of 
nlgnlglga + O(nlgn) bits or n lg lg a + 0(n) words. 

► Theorem 5. We can store A in nlglgcr + 0(n) words such that later, given i, j and t, 
in 0(1/t) time we can list all the distinct elements that occur at least r(j — i + 1) times in 
A[z..j}. 

Proof. We apply Corollary [4] to A and store 0{n) words such that we can support access, 
partial rank and select on A in 0(1) time (using, say, two copies of the structure of Barbay 
et al. [3] plus the partial rank structure [1]). Given i, j and t, we build a list of C(l/r) 
positions that includes the position of the first occurrence in of each distinct clement 

that occurs at least r(j — times in A[i. ._?']. For each position k in this list, we list A[k] if 
Aselect^fc] ^A.rank j 4[ fc j(fc) + \r(j — — lj < j. Repetitions are avoided by marking 

a bitmap of length a < n. -4 

2.3 Compressed space and near-optimal time 

► Lemma 6. Suppose we are given values b < \lgn\ and t < \lgn\ and a data structure 
supporting access and rank on A in C(lglg a) time. Then we can store O^ " 2 + i^sh^j 
more bits such that later, given i, j and r with [lg(j — i + 1)J = b and [lg(l/r)] = t, in 
C((1/t) lglg a) time we can list all the distinct elements that occur at least r(j — i + times 
in A[i..j] . 

Proof. We divide A into blocks of length 2 h_1 and store a bitvector in which B[k] = 1 

if, first, A[k] is the first or last occurrence of that distinct element in that block and, second, 
A[k] occurs at least 2 b /2* times in A[k-2 b+1 ..k + 2 b+1 ]. At most 9-2* elements in any block 
A[£..£+ 2 b_1 — 1] are marked by Is in B, because at most 9 • 2*" 1 distinct elements occur at 
least 2 fc /2' times (each) in A[£ - 2 b+1 .l + 2 b ~ 1 - 1 + 2 b+1 ]. It follows that we can store B 
in o( j l2 + Y^-^j bits using Patra§cu's [22 structure (recall that nH = C(mlg ^) on 
a bitmap with m Is; here m < 9 • 2' n/2 b_1 ). 

Given i, j and r, we find all the distinct elements in A[j..j] marked by Is in B. Since 
j — i+1 < 2 b+1 , A[i..j] overlaps at most 5 blocks, thus there are at most 45 • 2* = C(l/r) 
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marked elements. For each such element a, we compute v4.rank a (j) — Arank a (i — 1) and 
output a if that difference is at least r(j — i + 1). This takes a total of 0((l/r) lglg cr) time. 

To see that our list is complete, suppose we should list an element a. Notice must 
include either the beginning or the end of the block containing the first occurrence of a in 
-A[z..j']. If it includes the beginning, then that occurrence of a is marked; otherwise, the last 
occurrence of a in the block is both in A[i..j] and marked. Repeated elements are avoided 
as before, using a extra bits. < 

Notice that, even if some occurrence of a in must be marked by a 1 in B, it 

may not be the first such occurrence. This is why we use rank here, and not the faster 
combination of partial rank and select that we used in the proof of Theorem [5] 

► Corollary 7. Suppose we are given a value t < |~lg n\ and a data structure supporting access 
on A in 0(1) time and rank on A in 0(lglg cr) time. Then we can store 0^ "igi| ^ ° + T^^j 
more bits such that later, given i, j and r with [lg(l/r)] = t, in 0((1/t) lglgcr) time we 
can list all the distinct elements that occur at least r(j — i+ 1) times in A[i..j]. 

Proof. If j — i — 1 = 0((l/r) lglg cr), then we can afford to run Misra and Gries' [2"D] 
algorithm. Therefore, we need apply Lemma [6] only for values of b between |lg(2* lglg<r)J 
and [lgn\ , which takes a total of 0^ n \ s g \ s g ^ a + 5^) bits. < 

► Theorem 8. We can store A in nH + o{n) lg a bits such that later, given i, j and t , in 
0((l/r) lglg cr) time we can list all the distinct elements that occur at least r(j — i + 1) times 
in A[i..j]. The structure also offers constant-time access to A and O (lglgcr) -time rank and 
select. 

Proof. We store A in a data structure by Barbay et al. [3 [5] that takes nH + o(n(H + 1)) 
bits and supports access in constant time, and rank and select in O(lglgcr) time. If 1/r > cr, 
then we can afford to compute Arank a (j) — A.rank a (i — 1) for each distinct element a in A. 
Therefore, we need apply Corollary [7] only for values of t between and [lg cr] , which takes a 
total of g( " ls f g 1 1 s g 'g lgCT + ^) bits. This is o(n) lg cr unless a = 0(1), in which case we can 
compute the frequency of each distinct symbol in the range, in a total of 0(a lglgcr) = 0(1) 
time. M 

In the full version of this paper we will reduce the space bound in Theorem [8] to nH + 
o(n(H +1)) bits. For those reviewers who are curious, we have included a sketch of the 
ideas in Appendix \K\ 

2.4 Linear space and nearer-optimal time 

We now give a third tradeoff by building on the same solution of Section |2.3| The idea is 
to replace the compressed representation of A by a linear-space one that allows us to count 
the number of occurrences occ of any element in any range in time ^lg lg( ) ) • 

To achieve it, we start describing a one-dimensional range counting data structure, which 
handles n points in [1, U] and can count the number occ of points in any range [i, j] within 
the same time. Such structures have independent interest. 

► Theorem 9. Given n points in the discrete universe [1, U], there exists a data structure 
using 0(nlg[7) bits that can return the number occ of points in any range in time 



0(lglg 



j-i+l 
oee+1 



Proof. In Sections [2XT1 and [2X2] 
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2.4.1 Data structure 

We use [lg n \ — 1 levels. At each level I > 2 we build a data structure that efficiently answers 
queries of length between 2 e ~ 2 + 1 and 2 £_1 . Our structure defines specific intervals and 
subintervals. For clarity we will refer to ranges to denote any other range of the universe. 



2.4.1.1 Data structure for level I 

Given a level £, we divide the universe into \U/2 i ~ 1 ~\ overlapping intervals of size 2 £ , so that 
interval number k will be [2 e ~ 1 k + 1, 2 l ~ 1 k + 2 1 ]. It is clear that any range of size at most 
2 l_1 will be included in at least one interval. 

We only consider non-empty intervals. We can have at most 2n non-empty intervals, 
as each point belongs to 2 intervals. We use a minimal perfect hash function fi that maps 
the n' < 2n non-empty intervals into unique numbers in [l,n']. The minimal perfect hash 
function [T7] uses 0(n' + lg lg U) — 0(n + lg lg U) bits of space and works in constant time. 

We consider how to solve queries on non-empty intervals. Suppose that an interval k 
contains rik elements. We cut the interval into equally-sized subintervals, of size \2 /nk] 
(the last subinterval can be shorter). We use a prefix sum data structure to store the 
number of elements in each subinterval of the interval k, by appending the cardinalities of 
the subintervals in a bitmap in unary and using select to get prefix sums. Such a prefix sum 
data structure uses 0(rik) bits. The space usage over all the prefix-sum data structures for 
all the intervals is 0(n) bits. We concatenate the prefix-sum data structures of the intervals 
(in the order given by the minimal perfect hash function) and store another bitmap Dg that 
marks the beginning of the prefix-sum data structure of each interval. This new bitmap also 
uses 0(n) bits. 

2.4.1.2 Global data structure 

Our global data structure consists of the following pieces, adding up to 0(nlgU) bits. 



1. The data structures of Section 2.4.1.1 built for levels I = 2 to I = [lgn~|. Each data 
structure uses 0(n + lg lg U) bits of space, resulting in 0(n lg n + lg n lg lg U) bits overall. 

2. A short- distance sensitive predecessor data structure for the set of points that, given an 
element x € U, returns the predecessor of x in O(lglgd) time, where d is the minimum 
of the distances from x to its predecessor and to its successor in S. This data structure 
occupies 0(nlg U) bits of space. It was described by Bose et al. [8, 9 , and the space was 
further improved recently by Belazzougui et al. [S]. 

3. A constant-time range- emptiness data structure for the set of points, by Alstrup et al. 
[1 . This data structure is given a range and returns whether the range contains 
at least one point or not. It also requires (D(n\gU) bits of space. 



2.4.2 Query answering 

We first do a range-emptiness query to determine whether the query range [i, j] contains at 
least one element. If not, we immediately return 0. Otherwise we compute £ — [lg ( j — i + 
1)] + 1. Then, in constant time, we determine the interval of level I that encloses and 
go to that level to answer the query. We note that interval by [I, J] . 

We first use fi to find the index k of the interval [I, J] . Because we know that the interval 
is non-empty, we are sure that the minimal perfect hash function gives a meaningful answer. 
Next we use Di to recover the prefix-sum data structure for the interval [/, J]. Then, we 
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find the subinterval [Iq, Jo] of [I, J] that contains i and the subinterval [Ii, J\] that contains 
j. Notice that the count of the number of elements in equals the sum of the count of 
the number of elements in the three ranges [i, Jo], [Jo + 1, I\ — 1] and [Ji, j]. The count of 
the ranges [ Jo + 1, Ii — 1] can be found in constant time using the prefix sum data structure 
associated to interval [I, J], as the range [Jo + 1, l\ — 1] is aligned to subinterval boundaries. 

What remains is to determine the counts in the two tail ranges [i, Jo] and [Jl,j]. We 
only show how to determine the count in range [i, Jo]; the other case is symmetric. First 
we query the prefix-sum data structure in order to determine whether the subinterval [Jo — 
\2 l /rik] + 1, Jo] is empty or not. If the subinterval is empty, then we conclude that the range 
[i, Jo] is also empty. Otherwise the subinterval [Jo — \2 /nk] + 1, Jo] contains at least one 
element, so we use the distance-sensitive predecessor data structure to count the number of 
elements in [i, Jo]. That is, we call it to compute the predecessors of i and Jo respectively 
and compute count of elements by subtracting the rank of the predecessor of i from the 
rank of the predecessor of Jo. The query time of the distance sensitive data structure will 
be 0(lglg|~2^/nfc]), since the distance of any element in the subinterval to i and to Jo is at 
most \2 e /nk]- Now note that njt, the number of elements in [I, J], is at least occ. On the 
other hand J — I + 1 = 2 is at most 4(j — i + 1). We thus conclude that the query time is 
^(iglgCSjr)), and Theorem^ proved. 

Given that we are building on an unpublished result [5] , we include in Appendix [C] a 
data structure that needs o(n lg U) bits on top of the original array and answers predecessor 
queries for an clement y G [x,z], limited to the range [x,z] and assuming such range is 
not empty, in time C(lglg(z — x + 1)). Our result in this section is also obtained with this 
simpler data structure, applied on the interval given by x = Jo — [~2 /n&] + 1 and z = J . 



2.4.3 Back to the original problem 

The next result, also of independent interest, is the key to improve our time complexity. 

► Theorem 10. We can store A in 0{n) words so that later, given i, j and a, we can return 
the number occ of occurrences of element a in A[i,j] in time O^lglg J o ~ e ^ ^ . 

Proof. For each distinct element a G [1, a] of A we store the positions where a occurs in 
A[l,n] using the data structure of Theorem [9j If a occurrs n a times, we use 0(n a lgn) 
bits, for a total of O(nlgn) bits, or 0(n) words. We also need 0(a) = 0(n) words for the 
pointers to each data structure. To answer a query that asks for the number of occurrences 
of a in [i,j], we use the data structure for a and answer the corresponding range counting 
query. ■< 

In our specific PRMaj problem we have an interval and have a specific element a 
that occurs at least once in [i,j]. Note that the element we want to count is marked only if 
it appears at least 2 b /2* times in [i — 2 b+1 , j + 2 b+1 ]. This interval is of size at most 2 b+3 and 
contains at least 2 & /2* elements. Thus to count the number of elements in we can use 
level I = 6 + 4 so that we can be sure to find an interval that fully covers [i — 2 b+1 ,j + 2 b+1 ]. 
The counting time will be o(lglg\^:]) = lglg(2 4 2*) = O(lgt) = 0(lglg(l/r)). Thus we 
have obtained our result. 

► Theorem 11. We can store A in 0(n) words such that later, given i, j and t, in time 
0((1/t) lglg(l/r)) we can list all the distinct elements that occur at least r(j — i + 1) times 
in A[i..j]. 
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3 Parameterized Range Minority 

The main idea in this section, following Chan et al. [HI], is to find sufficiently many different 
elements in the interval so that one of them must be a minority, if there is one. The problem 
reduces to using appropriate data structures to compute the frequency of the candidates in 
the query range using rank queries. 

3.1 Nearly compressed space and no slowdown 

Recall from Section 



2.2 



' that C is the array in which C[k] = select^] (rank^^fc) — 1) if A[k] 
is not the first occurrence of that distinct clement in A, or if it is. Therefore, if we have 
a data structure that supports access, partial rank and select on A in 0(f(n)) time, then 
we can support access to C in 0(f(n)) time. Note that Muthukrishnan's [2T] technique to 
find the distinct elements in is real-time, that is, it gives the first k answers in time 

O(k), for any k. Combining these observations yields the following lemma. 

► Lemma 12. Suppose we have data structures that support access, partial rank and select 
on A and RMQs on C, all in 0(/(n)) time. Then, given i and j , we can list the positions 
of the first occurrences in A[i..j] of k distinct elements in A[i..j] using 0{f(n)) worst-case 
time per position listed, for any k. 

Belazzougui and Navarro 6 showed how we can store a set of monotone minimal perfect 
hash functions in 0(n) +o(nH) bits and support partial rank on A in 0(1) time. Bar bay et 
al. [2] showed how, given any constant e > 0, we can store A in (1 + e)nH + o(n) bits such 
that we can support access and select on A in 0(1) time. This yields the next result. 

► Theorem 13. Given any constant e > 0, we can store A in (1 + e)nH + 0(n) bits such 
that later, given i, j and t, in 0(l/r) time we can find an element that occurs in A[i..j] but 
at most t(j — i + 1) times, or determine that no such element exists. 

Proof. We use Belazzougui and Navarro's and Barbay et al.'s data structures to store A 
in (1 + e)nH + 0(n) bits such that we can support access, partial rank and select in 0(1) 
time. We store an instance of Fischer's RMQ data structure [13] . which takes 0(n) bits, 



so that we can support RMQs on C in 0(1) time. Given i, j and r, we use Lemma 12 
to list the positions of the first occurrences in of [1/r] distinct elements (or all of 

them, if there are fewer). Notice there cannot be [1/r] distinct elements that all occur 
more than t(J — i + 1) times in For each position k in this list, we check whether 

A. select A[k] ^• ran k J 4[fc](fc) + \ T (j — z + 1)^ — 1^ > j and, if so, return A[k] and stop. This 
takes a total of 0(l/r) time. -4 

3.2 Compressed space and nearly no slowdown 

In another paper, Belazzougui and Navarro [? showed how we can store A in nH+o(n(H+l)) 
bits and support access and select in 0(/(n)) time for any function f(n) = w(l). If we use 



this result in the proof of Theorem 13 instead of Barbay et al.'s, with f(n) = ct(n), then we 
obtain the new version below. 

► Theorem 14. We can store A in nH + 0{n) + o(nH) bits such that later, given i, j and 
t, in 0(a(n)/r) time we can find an element that occurs in A[i..j] but at most r(j — i + 1) 
times, or determine that no such element exists. 
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In the full version of this paper we will reduce the space bound in Theorem [14] to nH + 
o(n(H +1)) bits. For those reviewers who are curious, we have included a sketch of the 
ideas in Appendix [B| 

4 Conclusions 

We have given the first linear-space data structure with 0(1/ t) query time, which is worst- 
case optimal. Moreover, we have improved the space bounds for parameterized range ma- 
jority and minority in the important case of variable r. For PRMaj, we can achieve nearly 
linear space with optimal time, or up to compressed space with a slight slowdown. For 
PRMin, we can achieve nearly compressed space with time 0(1/t), or compressed space 
with a slight slowdown. In the full version of this paper we will show how to use fully 
compressed space with the same slowdowns. The slowdowns are quite small by factors 
of O(lglgcr) to 0(a(n)) — but we would like to eliminate them entirely, or show that this 
is impossible. We leave that as the main open problem from this paper. 
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Appendices included for review purposes only. 
A Sketch of Fully-Compressed Parameterized Range Majority 

As in the proof of Theorem [8] we start by storing A in a data structure by Barbay et al. [3J [5] 
that takes nH + o(n(H + 1)) bits and supports access in constant time, and rank and select 
in O(lglgcr) time. We can assume a — w(l) since, otherwise, we can compute the frequency 
of each distinct symbol in a given range, in a total of ©(crlglgcr) = 0(1) time. Barbay 
et al.'s data structure is based on partitioning the set of distinct elements in A according 
to their frequencies and storing data structures supporting access, rank and select on, first, 
an array T[l..n] with T[k] = i indicating that we place A[k] in the £th subset; second, for 
an array Si with Sg[k] indicating the rank of A[T. select i(k)] in the £th 

subset. 

We add data structures such that we can support PRMaj queries on T and each Si. Let 
A' be either T or an array Si, let n' be the length of A' and let a' < a be the number 
of distinct elements in A' . In the full version of this paper we will prove, by generalizing 
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Corollary |tJ that we can store O^n'lga' ■ g J + j^t^ more bits such that we can answer 
PRMaj queries on A' in 0((1/t) lglgcr) time. For now, however, we use a clumsier but easier 
approach. 

If A' is T or cr' < lgn then, as part of Barbay et al.'s data structure, we have A' stored 
as a multiary wavelet tree that supports access, rank and select in 0(1) time; otherwise, 
we have it stored as an instance of a data structure by Golynski, Munro and Rao [TB] 
that supports access in 0(1) time and rank and select in 0(lglg cr) time. Therefore, if 
&' < lglgcr, we can compute the frequency of each distinct symbol in any given range, in a 
total of 0(cr') = 0(lglg cr) time. If cr' > lglgcr, then we apply Corollary [7] to A' for values 
of t between and |~lg cr'] . This way, we store 

Q f n'\ga'\g lglgcr' | n' \ o( n 'lga' lglglglglgcr . n ' 



lglgcr' lgn'/ V lglg lglgcr lgn' 

bits more bits for A 1 and can answer PRMaj queries on it in 0((1/t) lglgcr') time. 

Given i, j and r, we use a PRMaj query on T to find the 0(1/t) distinct elements 
that each occur at least r(j — i + 1) times in T[i..j]. For each such element £, we compute 
ig = T.rank^i — 1) + 1, je — T.rank^j) and Tg = T ^ l Zl^l ■ We then use a PRMaj query on Sg 
to find the 0(1/t£) distinct elements that each occur at least Tg{jg — ig + l) times in St\ig,..jt\. 
We then map each of these distinct elements to the corresponding distinct element in A 
using another multiary wavelet tree that is part of Barbay et al.'s data structure. All these 
operations together take a total of 0((l/r + J2e ^/ T t) lglg cr ) time. Since Yltih ~ U + 1) < 
j — i + 1, we have J^e V T f — 1/ T - Therefore, we use a total of 0((l/r) lglgcr) time. 

► Theorem 15. We can store A in nH + o(n(H + 1)) bits such that later, given i, j and r, 
in 0((1/t) lg lg cr) time we can list all the distinct elements that occur at least t(j — i + 1) 
times in A[i..j]. The structure also offers constant-time access to A and O (lglg a) -time rank 
and select. 



B Sketch of Fully-Compressed Parameterized Range Minority 



The 0(n) term in the space bound of Theorem 14 comes from two places: first, Belazzougui 
and Navarro's [B] set of monotone minimal perfect hash functions, which take 0(n) + o(niJ) 
bits; second, Fischer's [13] RMQ data structure, which takes 2n + o(n) bits. To eliminate 
that term, we first show that we need only o(n(ff+l)) bits for the monotone minimal perfect 
hash functions, and then show how to build Fischer's data structure over a sampling of C . 

Belazzougui and Navarro's [7] data structure supporting access and rank on A is also 
based on alphabet partitioning (see Appendix [A|. This means their data structure supports 
access, rank and select queries in 0(1) time whenever the element involved occurs at least 
n/lgn times in A. It follows that we need store monotone minimal perfect hash functions 
only for elements occurring fewer than n/lgn times in A, and calculation shows this takes 
a total of o(n(H + 1)) bits. We will provide more details in the full version of this paper. 

► Theorem 16. We can store A in nH + o(n(H + 1)) bits and support access and select in 
0(/(n)) time for any function /(n) = w(l), and partial rank in 0(1) time. 



Applying Theorem 



16 



to A lets us support access to C in 01 ^/a(n)\ time. Let C be 



the array defined by C'[k] = min 

C (fc- l)V«(n) 



k^Ja(n) 



which 



has length o[nj a(n)j — o(n). We store an instance of Fischer's RMQ data structure for 
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C in o(n) bits. Given an interval in C, we use this data structure to find a subinterval of 
length y/ a(n) that contains that interval's leftmost minimum. We then compute the entries 
of C in the that subinterval and find the exact position of the leftmost minimum, in a total 
of 0(a{n)) time. 

► Theorem 17. We can store A in nH + o(n(H + 1)) bits such that later, given i, j and 
t, in 0{a{n) /r) time we can find an element that occurs in A[i..j] but at most r(j — i + 1) 
times, or determine that no such element exists. 

C A Simple Distance-Sensitive Data Structure 

Assume we have a table X that contains n elements from universe U in sorted order. 

We use lg lg U levels. At a level £, we divide the universe into \U /2 2 ] overlapping 
intervals, so that interval k will be [k-2 2 +1, (fc+2)2 2 ]. We consider separately the intervals 
with even and odd k (we call them odd or even intervals). For each of the two categories, 
the set of intervals will be disjoint. For each category, we use a monotone minimum perfect 
hash function (mmphf) Fg that stores the k values corresponding to nonempty intervals, 
and a prefix sum data structure to store the number of elements in each nonempty interval 
k, by appending the cardinalities of the intervals in a bitmap Bg in unary (i.e., cardinality c 
is stored as l c_1 0) and using select to get prefix sums. With Ft and Bg we map in constant 
time from a nonempty interval k to its corresponding area in X (say, p = Fg(k), then the area 
is X[selecto(Bg, k — 1) + 1 .. selecto(-B^, k)]). Since there are at most n nonempty intervals 
of each category, Fg uses O(nlglgU) bits. Bitmap Bg uses 0(n) bits. 

In addition, for each nonempty interval with more than 2 e elements, we store a local 
predecessor search data structure (lpsds). The lpsds of an interval samples one every 2 e 
elements in the interval and stores them in a local y-fast trie. The y-fast trie of elements 
x £ [k ■ 2 2 + 1, (k + 2)2 2 ] will store x — k ■ 2 2 — 1, and thus will range over a universe of 
size o(^2 2 ^j. Since they store, in total, C(n/2 £ ) elements over a universe of size o{2 2 ^j, 
the space of all the lpsds adds up to O(n) bits. 

Since the lpsds storing m r elements uses at most cm r bits, for some constant c, we store 
them one after the other, reserving cm r bits for each lpsds storing m r elements. We store 
a bitmap Pg as a partial sum data structure on the m r values, concatenating l mr (assume 
m r = if there are < 2 l elements and thus no lpsds is stored). We can find in constant 
time, using select on Pg , the starting point of each lpsds. Pg uses at most 0(n) bits. 

Then to do predecessor search on interval we proceed as follows: 

1. We compute I — fig lg ( j — i+ 1)], so that the query is for sure contained in an (even or 
odd) interval k of level £. Number k is found algebraically in constant time. 

2. We use Fg to map k to its position p in the nonempty intervals, and then Bg to find the 
corresponding range X[ik..jk\. This takes constant time. 

3. If j'fe — ife + 1 < 2 £ , we complete the query with a binary search on X\i^.,j^, in time 
0(£), and finish. 

4. We use the local predecessor search data structure of interval p, found using Pg, to 
determine the subinterval [ii-J^] ^= [ik--3k] of size 2 l where the answer lies. This takes 
time C(lglg2 2f ) =0(£). 

5. We complete the query using binary search on in time O(l). 

Our data structures use in total 0(nlglgU) bits for a given level £, which adds up to 
C(n(lglg U) 2 ) bits in total. They answer in time O(t) — C(lglg(j — i+ 1)) on nonempty 
intervals. Note that on empty intervals our mmphf Fg could return an arbitrary value. 



